Method and system for storing snapshots in hyper-converged infrastructure

ABSTRACT

In an embodiment of the present disclosure, a processor receives a request to create a volume, wherein a volume placement policy comprises a plurality of scheduler algorithms, each of the scheduler algorithms selecting one or more worker nodes from a plurality of worker nodes for volume storage, determines, based on output from the plurality of scheduler algorithms, the one or more worker nodes, wherein output of each of the plurality of scheduler algorithms is assigned a weight in determining the one or more worker nodes; and causes a node agent in each of the one or more worker nodes to create the volume.

CROSS REFERENCE TO RELATED APPLICATION

The present application claims the benefits of U.S. Provisional Application Ser. No. 63/114,295, filed Nov. 16, 2020, entitled “Method and System for Managing Cloud Resources” and U.S. Provisional Application Ser. No. 63/194,611, filed May 28, 2021, entitled “Method and System for Storing Snapshots in Hyper-Converged Infrastructure,” both of which are incorporated herein by this reference in their entirety.

FIELD

The invention relates generally to distributed processing systems and particularly to cloud computing systems.

BACKGROUND

Storage snapshots are point-in-time copies of volumes and an integral part of data protection strategy. A snapshot could be a full copy of the volume or it could be space-optimized and share unmodified blocks with the volume. In case of a hyper-converged infrastructure, volume data is located on direct-attached drives of compute or worker nodes. Multiple copies of each data block are maintained across nodes for mirrored volumes. Mirrored volumes provide data redundancy in the event of a drive or node failure.

Space-optimized snapshots are co-located with their parent volumes since unmodified blocks are shared. For example, snapshot for an N-way mirrored volume can be non-mirrored or have 1 to N-1 additional mirrors.

SUMMARY

In certain embodiments, the present disclosure relates to a method that includes the steps of: receiving a request to create a volume in accordance with a volume placement policy that comprises a plurality of scheduler algorithms, each of which selects one or more worker nodes from a plurality of worker nodes for volume storage; determining, based on output from the plurality of scheduler algorithms, the one or more worker nodes, with an output of each of the plurality of scheduler algorithms being assigned a weight in determining the one or more worker nodes; and causing, by the processor, a node agent in each of the one or more worker nodes to create the volume.

In some embodiments, a server includes: a communication interface to transmit and receive communications; a processor coupled with the communication interface; and a computer readable medium, coupled with and readable by the processor and storing therein a set of instructions. The set of instructions, when executed by the processor, causes the processor to: receive a request to create a volume, wherein a volume placement policy comprises a plurality of scheduler algorithms, each of which selects one or more worker nodes from among a plurality of worker nodes for volume storage; determine, based on an output from the plurality of scheduler algorithms, the one or more worker nodes, wherein an output of each of the plurality of scheduler algorithms is assigned a weight in determining the one or more worker nodes; and instruct a node agent in each of the one or more worker nodes to create the volume.

In some embodiments, a method includes the steps of: receiving, by a processor, a request to create a volume in accordance with a volume placement policy that comprises a plurality of scheduler algorithms, each of which selects one or more clusters of worker nodes from among a plurality of worker node clusters corresponding to a common tenant for volume storage; determining, by the processor based on an output from the plurality of scheduler algorithms, the one or more worker node clusters, with an output of each of the plurality of scheduler algorithms being assigned a weight in determining the one or more worker node clusters; and causing, by the processor, a node agent in each of the one or more worker node clusters to create the volume on a worker node in each of the one or more worker node clusters.

In some embodiments, a server includes: a communication interface to transmit and receive communications; a processor coupled with the communication interface; and a computer readable medium, coupled with and readable by the processor and storing therein a set of instructions. The set of instructions, when executed by the processor, causes the processor to: receive a request to create a volume in accordance with a volume placement policy that comprises a plurality of scheduler algorithms, each of which selects one or more clusters of worker nodes from among a plurality of worker node clusters corresponding to a common tenant for volume storage; determine, based on an output from the plurality of scheduler algorithms, the one or more worker node clusters, with an output of each of the plurality of scheduler algorithms being assigned a weight in determining the one or more worker node clusters; and cause a node agent in each of the one or more worker node clusters to create the volume on a worker node in each of the one or more worker node clusters

The present invention can provide a number of advantages depending on the particular configuration. For example, the present disclosure can provide a flexible architecture to allow users to configure their snapshot volume placement depending on their needs. Users can have a number of policies and can choose among them at the time of snapshot volume creation.

These and other advantages will be apparent from the disclosure of the invention(s) contained herein.

The phrases “at least one”, “one or more”, and “and/or” are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each of the expressions “at least one of A, B and C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “one or more of A, B, or C” and “A, B, and/or C” means: A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.

The term “a” or “an” entity refers to one or more of that entity. As such, the terms “a” (or “an”), “one or more” and “at least one” can be used interchangeably herein. It is also notable that the terms “comprising”, “including”, and “having” can be used interchangeably.

The term “application containerization” may be used to refer to an operating system-level virtualization method that deploys and runs distributed applications or virtualized applications (e.g., containerized or virtual machine-based applications) without launching an entire virtual machine for each application. Multiple isolated applications or services may run on a single host and access the same operating system kernel.

The term “automatic” and variations thereof may refer to any process or operation done without material human input when the process or operation is performed. However, a process or operation can be automatic, even though performance of the process or operation uses material or immaterial human input, if the input is received before performance of the process or operation. Human input is deemed to be material if such input influences how the process or operation will be performed. Human input that consents to the performance of the process or operation is not deemed to be “material”.

The term “computer-readable medium” may refer to any tangible storage and/or transmission medium that participate in providing instructions to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, NVRAM, or magnetic or optical disks. Volatile media includes dynamic memory, such as main memory. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, magneto-optical medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, a solid state medium like a memory card, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read. A digital file attachment to e-mail or other self-contained information archive or set of archives is considered a distribution medium equivalent to a tangible storage medium. When the computer-readable media is configured as a database, it is to be understood that the database may be any type of database, such as relational, hierarchical, object-oriented, and/or the like. Accordingly, the invention is considered to include a tangible storage medium or distribution medium and prior art-recognized equivalents and successor media, in which the software implementations of the present invention are stored.

The term “cluster” may refer to a group of multiple worker nodes that deploy, run and manage containerized or VM-based applications and a master node that controls and monitors the worker nodes. A cluster can have an internal and/or external network address (e.g., DNS name or IP address) to enable communication between containers or services and/or with other internal or external network nodes.

The term “container” may refer to a form of operating system virtualization that enables multiple applications to share an operating system by isolating processes and controlling the amount of processing resources (e.g., central processing unit (CPU), graphics processing unit (GPU), etc.), memory, and disk those processes can access. While containers like virtual machines share common underlying hardware, containers, unlike virtual machines they share an underlying, virtualized operating system kernel and do not run separate operating system instances.

The terms “determine”, “calculate” and “compute,” and variations thereof are used interchangeably and include any type of methodology, process, mathematical operation or technique.

The term “deployment” may refer to control of the creation, state and/or running of containerized or VM-based applications. It can specify how many replicas of a pod should run on the cluster. If a pod fails, the deployment may be configured to create a new pod.

The term “domain” may refer to a set of objects that define the extent of all infrastructure under management within a single context. Infrastructure may be physical or virtual, hosted on-premises or in a public cloud. Domains may be configured to be mutually exclusive, meaning there is no overlap between the infrastructure within any two domains.

The term “domain cluster” may refer to the primary management cluster. This may be the first cluster provisioned.

The term “Istio service mesh” may refer to a service mesh layer for containers that adds a sidecar container to each cluster that configures, monitors, and manages interactions between the other containers.

The term “Knative” may refer to a platform that sits on top of containers and enables developers to build a container and run it as a software service or as a serverless function. It can enable automatic transformation of source code into a clone container or functions; that is, Knative may automatically containerize code and orchestrate containers, such as by configuration and scripting (such as generating configuration files, installing dependencies, managing logging and tracing, and writing continuous integration/continuous deployment (Cl/CD) scripts. Knative can perform these tasks through build (which transforms stored source code from a prior container instance into a clone container or function), serve (which runs containers as scalable services and performs configuration and service routing), and event (which enables specific events to trigger container-based services or functions).

The term “master node” may refer to the node that controls and monitors worker nodes. The master node may run a scheduler service that automates when and where containers are deployed based on developer-set deployment requirements and available computing capacity.

The term “module” may refer to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and software that is capable of performing the functionality associated with that element. Also, while the invention is described in terms of exemplary embodiments, it should be appreciated that individual aspects of the invention can be separately claimed.

The term “namespace” may refer to a set of signs (names) that are used to identify and refer to objects of various kinds. In Kubernetes, for example, there are three primary namespaces: default, kube-system (used for Kubernetes components), and kube-public (used for public resources). Namespaces are intended for use in environments with many users spread across multiple teams, or projects. Namespaces may not be nested inside one another, and each Kubernetes resource may be configured to only be in one

namespace. Namespaces may provide a way to divide cluster resources between multiple users (via resource quota). The extension of namespaces in the present disclosure is discussed at page 9 of Exhibit “A”. At a high level, the extension to namespaces enables multiple virtual clusters (or namespaces) backed by a common set of physical (Kubernetes) cluster.

The term “pods” may refer to groups of containers that share the same compute resources and the same network.

The term “project” may refer to a set of objects within a tenant that contains applications. A project may act as an authorization target and allow administrators to set policies around sets of applications to govern resource usage, cluster access, security levels, and the like. The project construct can enable authorization (e.g., Role Based Access Control or RBAC), application management, and the like within a project. In one implementation, a project is an extension of Kubernetes' use of namespaces for isolation, resource allocation and basic authorization on a cluster basis. Project may extend the namespace concept by grouping together multiple namespaces in the same cluster or across multiple clusters. Stated differently, projects can run applications on one cluster or on multiple clusters. The resources are allocated per project basis.

The term “project administrator” or “project admin” or PA may refer to the entity or entities responsible for adding members to a project, manages users to a project, manages applications that are part of a project, specifies new policies to be enforced in a project (e.g., with respect to uptime, SLAs, and overall health of deployed applications), etc.

The term “project member” or PM may refer to the entity or entities responsible for deploying applications on Kubernetes in a project, responsible for uptime, service level agreements (“SLAs”), and overall health of deployed applications. The PM may not have permission to add a user to a project.

The term “project viewer” or PV may refer to the interface that enables a user to view all applications, logs, events, and other objects in a project.

The term “resource”, when used with reference to Kubernetes, may refer to an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind; for example, the built-in pods resource contains a collection of pod objects.

The term “serverless computing” may refer to a way of deploying code that enables cloud native applications to bring up the code as needed; that is, it can scale it up or down as demand fluctuates and take the code down when not in use. In contrast, conventional applications deploy an ongoing instance of code that sits idle while waiting for requests.

The term “service” may refer to an abstraction, which defines a logical set of pods and a policy by which to access them (sometimes this pattern is called a micro-service).

The term “service provider” or SP may refer to the entity that manages the physical/virtual infrastructure in domains. In one implementation, a service provider manages an entire node inventory and tenant provisioning and management. Initially a service provider manages one domain.

The term “service provider persona” may refer to the entity responsible for hardware and tenant provisioning or management.

The term “snapshot” may refer to a point-in-time copy of a volume. A snapshot can be used either to provision a new volume (pre-populated with the snapshot data) or to restore an existing volume to a previous state (represented by the snapshot).

The term “tenant” may refer to an organizational construct or logical grouping used to represent an explicit set of resources (e.g., physical infrastructure (e.g., CPUs, GPUs, memory, storage, network, and cloud clusters, people, etc.) within a domain. Tenants “reside” within infrastructure managed by a service provider. By default, individual tenants do not overlap or share anything with other tenants; that is, each tenant can be data isolated, physically isolated, and runtime isolated from other tenants by defining resource scopes devoted to each tenant. Stated differently, a first tenant can have a set of resources, resource capabilities, and/or resource capacities that is different from that of a second tenant. Service providers assign worker nodes to a tenant, and the tenant admin forms the clusters from the worker nodes.

The term “tenant administrator” or “tenant admin” or TA may refer to the entity responsible for managing an infrastructure assigned to a tenant. The tenant administrator is responsible for cluster management, project provisioning, providing user access to projects, application deployment, specifying new policies to be enforced in a tenant, etc.

The term “tenant cluster” may refer to clusters of resources assigned to each tenant upon which user workloads run. The domain cluster performs lifecycle management of the tenant clusters.

The term “virtual machine” may refer to a server abstracted from underlying computer hardware so as to enable a physical server to run multiple virtual machines or a single virtual machine that spans more than one server. Each virtual machine typically runs its own operating system instance to permit isolation of each application in its own virtual machine, reducing the chance that applications running on common underlying physical hardware will impact each other.

The term “volume” may refer to an ephemeral or persistent volume of memory of a selected size that is created from a distributed storage pool of memory. A volume may comprise a directory on disk and data or in another container and be associated with a volume driver. In some implementations, the volume is a virtual drive and multiple virtual drives can create multiple volumes. When a volume is created, a scheduler may automatically select an optimum node on which to create the volume. A “mirrored volume” refers to synchronous cluster-local data protection while a “replicated volume” refers to asynchronous cross-cluster data protection.

The term “worker node” may refer to the compute resources and network(s) that deploy, run, and manage containerized or VM-based applications. Each worker node contains the services to manage the networker between the containers, communication with the master node, and assign resources to the containers scheduled. Each worker node can include a tool that is used to manage the containers, such as Docker, and a software agent called a Kubelet that receives and executes orders from the master node (e.g., the master API server). The Kubelet is a primary node agent which executes on each worker node inside the cluster. The Kubelet receives the pod specifications through an API server and executes the container associated with the pods and ensures that the containers described in the pods are running and healthy. If Kubelet notices any issues with the pods running on the worker nodes then it tries to restart the pod on the same node and if the issue is with the worker node itself then the master node detects the node failure and decides to recreate the pods on the other healthy node.

The preceding is a simplified summary of the invention to provide an understanding of some aspects of the invention. This summary is neither an extensive nor exhaustive overview of the invention and its various embodiments. It is intended neither to identify key or critical elements of the invention nor to delineate the scope of the invention but to present selected concepts of the invention in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other embodiments of the invention are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a cloud-based architecture according to an embodiment of this disclosure;

FIG. 2 is a block diagram of an embodiment of the application management server;

FIG. 3 is a block diagram of a cloud-based architecture according to an embodiment of this disclosure;

FIG. 4 is a block diagram of a volume storage schema according to an embodiment of this disclosure;

FIG. 5 depicts a snapshot mapping schema according to an embodiment of the disclosure;

FIG. 6A is a screenshot of a volume create request according to an embodiment of the disclosure;

FIG. 6B is a screenshot of a snapshot create request according to an embodiment of the disclosure;

FIG. 7A is a screenshot of a snapshot placement policy create request with one selected policy according to an embodiment of the disclosure;

FIG. 7B is a screenshot of a snapshot placement policy create request with multiple placement policies according to an embodiment of the disclosure;

FIG. 7C is a screenshot of a snapshot placement policy create request with a placement policy option according to an embodiment of the disclosure;

FIG. 8A is a screenshot of snapshot describe request according to an embodiment of the disclosure;

FIG. 8B is a screenshot of a snapshot placement policy describe request according to an embodiment of the disclosure;

FIG. 8C is a screenshot of a response to the snapshot placement policy describe request of FIG. 8B;

FIG. 9 is a flow chart depicting disaster recovery controller logic according to an embodiment of this disclosure; and

FIG. 10 is a flow chart depicting an authentication method according to an embodiment of this disclosure.

DETAILED DESCRIPTION Overview

The present disclosure is directed to a multi-cloud platform that can provide a single plane of management console from which customers manage cloud-native applications and clusters and data using a policy-based management framework. The platform can be provided as a hosted service that is either managed centrally or deployed in customer environments. The customers could be enterprise customers or service providers. The platform can manage applications across multiple Kubernetes clusters, which could be residing on-premises or in the cloud or combinations thereof (e.g., hybrid cloud implementations). The platform can provide abstract core networker and storage services on premises and in the cloud for stateful and stateless applications.

The platform can migrate data (including application and snapshot volumes) and applications to any desired set of resources and provide failover stateful applications on premises to the cloud or within the cloud. As will be appreciated, an “application volume” refers to a volume in active use by an application while a “snapshot volume” refers to a volume that is imaged from an application volume at a specific point in time. It can provide instant snapshot volumes of containers or application volumes (e.g., mirroring or replicating of application or snapshot volumes), backup, and stateful or stateless application disaster recovery (DR) and data protection (DP).

The platform can automatically balance and adjust placement and lifecycle management of snapshot and application volumes depending on a user specified policy. Exemplary policies applied within a selected cluster to select one or more worker nodes for mirrored snapshot and application volume placement include: limiting a number of mirrors for volumes (e.g., no mirror, two-way mirroring, as many mirrors as the volume-storing worker nodes, etc.), substantially minimizing storage space needed for a volume, substantially maximizing volume storage distribution across worker nodes, dedicating one or more worker nodes for all volume storage, distributing volume storage substantially evenly and/or uniformly across worker nodes, distributing volume storage giving more priority to worker nodes that have more space available, selecting the worker node storage resources used for creation and lifecycle management of parent or linked clone volumes so as to minimize “noisy neighbor”, i.e. maintain Service Level Agreements (SLAs) for applications or micro-services whichare using the parent or application volume(s), selecting worker node storage resources so as to achieve a predetermined Recovery Point Objective (RPO) and/or Recovery Time Objective (RTO) for application recovery with the system automatically deciding on the placement and type of volume(s) to be created depending on the RPO and/or RTO to be achieved, and the like. Typically, RPO designates the variable amount of data that will be lost or will have to be re-entered during network downtime. RTO designates the amount of “real time” that can pass before the disruption begins to seriously and unacceptably impede the flow of normal business operations.

These policies can be applied not only within a cluster but also across clusters. Exemplary policies applied across clusters to select a cluster for replicated snapshot or application volume placement include: a user policy specifying one or more clusters for storing volumes, limiting a number of mirrors across clusters for volume storage (e.g., no mirror, two-way mirroring, as many mirrors as the volume-storing clusters, etc.), substantially minimizing storage space needed across clusters for volume placement, substantially maximizing volume distribution across clusters, dedicating one or more clusters for storing all volumes, distributing volume storage substantially evenly and/or uniformly across clusters (e.g., giving more priority to clusters that have more storage space available), selecting, among multiple clusters, cluster storage resources used for creation and lifecycle management of parent or linked clone volumes to substantially minimize “noisy neighbor”, i.e. maintain SLAs for applications and/or micro-services which are using the parent or application volume(s), and selecting, among multiple clusters, cluster storage resources so as to achieve a predetermined RPO and/or RTO for application recovery with the system automatically deciding on the placement and type of volume(s) to be created depending on the RPO and/or RTO to be achieved.

As will be appreciated, other user-specified policies will be envisioned by one of ordinary skill in the art for management of clusters and/or worker nodes.

The platform can enable organizations to deliver a high-productivity Platform-as-a-Service (PaaS) that addresses multiple infrastructure-related and operations-related tasks and issues surrounding cloud-native development. It can support many container application platforms besides or in addition to Kubernetes, such as Red Hat, OpenShift, Docker, and other Kubernetes distributions, whether hosted or on-premises.

While this disclosure is discussed with reference to the Kubernetes container platform, it is to be appreciated that the concepts disclosed herein apply to other container platforms, such as Microsoft Azure™, Amazon Web Services™ (AWS), Open Container Initiative (OCI), CoreOS, and Canonical (Ubuntu) LXD™.

The Multi-Cloud Platform

FIG. 1 depicts an embodiment of a multi-cloud platform according to the present disclosure. The multi-cloud platform 100 is in communication, via network 128, with one or more tenant clusters 132 a, . . . Each tenant cluster 132 a, . . . corresponds to multiple tenants 136 a, b, . . . , with each of the multiple tenants 136 a, b, . . . in turn corresponding to a plurality of projects 140 a, b, . . and worker node clusters 144 a, b, . . . Each containerized or VM-based application 148 a, b, . . . n in each project 140 a, b, . . . utilizes the worker node resources in one or more of the clusters 144 a, b, . . . .

To manage the tenant clusters 132 a . . . the multi-cloud platform 100 is associated with a domain cluster 104 and comprises an application management server 108 and associated data storage 110 and master application programming interface (API) server 114, which is part of the master node (not shown) and associated data storage 112. The application management server 108 communicates with an application programming interface (API) server 152 assigned to the tenant clusters 132 a . . . to manage the associated tenant cluster 132 a . . . In some implementations, each cluster has a controller or control plane that is different from the application management server 108.

The servers 108 and 114 can be implemented as a physical (or bare-metal) server or cloud server. As will be appreciated, a cloud server is a physical and/or virtual infrastructure that performs application- and information-processing storage. Cloud servers are commonly created using virtualization software to divide a physical (bare metal) server into multiple virtual servers. The cloud server can use infrastructure-as-a-service (IaaS) model to process workloads and store information.

The application management server 108 performs tenant cluster management using two management planes or levels, namely an infrastructure and application management layer 120 and stateful and application services layer 124. The stateful and application services layer 124 can abstract network and storage resources to provide global control and persistence, span on-premises and cloud resources, and provide intelligent placement of workloads based on logical data locality and block storage capacity. These layers are discussed in detail in connection with FIG. 2.

The API servers 114 and 152, which effectively act as gateways to the clusters, are commonly each implemented as a Kubernetes API server that implements a RESTful API over HTTP, performs all API operations, and is responsible for storing API objects into a persistent storage backend. Because all of the API server's persistent state is stored in external storage (which is one or both of the databases 110 and 112 in the case of master API server 114) that are typically external to the API server, the server itself is typically stateless and can be replicated to handle request load and provide fault tolerance. The API servers commonly provides API management (the process by which APIs are exposed and managed by the server), requests processing (the target set of functionality that processes individual API requests from a client), and provides internal control loops (that provide internals responsible for background operations necessary to the successful operation of the API server).

In one implementation, the API server receives https requests from Kubectl or any automation to send requests to any Kubernetes cluster. Users access the cluster using API server 152 and it stores all the API objects into an etcd data structure. As will be appreciated, etcd is a consistent and highly-available key value store used as Kubernetes' backing store for all cluster data. The master API server 114 receives https requests from user interface (UI) or dmctl. This provides a single endpoint of contact for all UI functionality. It typically validates the request and sends the request to the API server 152. An agent controller (not shown) can reside on each tenant cluster and perform actions in each cluster. Domain cluster components can use Kubernetes native or CustomResourceDefinitions (CRD) objects to communicate with the API server 152 in the tenant cluster. The agent controller can handle the CRD objects.

In one implementation, the tenant clusters can run controllers such as an HNC controller, storage agent controller, or agent controller. The communication between domain cluster components and tenant cluster are via the API server 152 on the tenant clusters. The applications on the domain cluster 104 can communicate with applications 148 on tenant clusters 144 and the applications 148 on one tenant cluster 144 can communicate with applications 148 on another tenant cluster 144 to implement specific functionality.

Data storage 110 is normally configured as a database and stores data structures necessary to implement the functions of the application management server 108. For example, data storage 110 comprises objects and associated definitions corresponding to each tenant cluster 144, and project and references to the associated cluster definitions in data storage 112. Other objects/definitions include networks and endpoints (for data networks), volumes (created from a distributed data storage pool on demand), mirrored volumes (created to have mirrored copies on one or more other nodes), snapshot volumes (a point-in-time image of a corresponding set of volume data), linked clones (volumes created from snapshot volumes are called linked clones of the parent volume and share data blocks with the corresponding snapshot volume until the linked clone blocks are modified), namespaces, access permissions and credentials, and other service-related objects.

Namespaces enable the use of multiple virtual clusters backed by a common physical cluster. The virtual clusters are defined by namespaces. Names of resources are unique within a namespace but not across namespaces. In this manner, namespaces allow division of cluster resources between multiple uses. Namespaces are also used to manage access to application and service-related Kubernetes objects, such as pods, services, replication, controllers, deployments, and other objects that are created in namespaces.

Data storage 112 includes the data structures enabling cluster management by the master API server 114. In one implementation, data storage 112 is configured as a distributed key-value lightweight database, such as an etcd key value store. In Kubernetes, it is a central database for storing the current cluster state at any point in time and also used to store the configuration details such as subnets, configuration maps, etc.

The communication network 128, in some embodiments, can be any trusted or untrusted computer network, such as a WAN or LAN. The Internet is an example of the communication network 128 that constitutes an IP network consisting of many computers, computing networks, and other communication devices located all over the world. Other examples of the communication network 128 include, without limitation, an Integrated Services Digital Network (ISDN), the Public Switched Telephone Network (PSTN), a cellular network, and any other type of packet-switched or circuit-switched network known in the art. In some embodiments, the communication network 128 may be administered by a Mobile Network Operator (MNO). It should be appreciated that the communication network 128 need not be limited to any one network type, and instead may be comprised of a number of different networks and/or network types. Moreover, the communication network 128 may comprise a number of different communication media such as coaxial cable, copper cable/wire, fiber-optic cable, antennas for transmitting/receiving wireless messages, wireless access points, routers, and combinations thereof.

With reference now to FIG. 2, additional details of the application management server 108 will be described in accordance with embodiments of the present disclosure. The server 108 is shown to include processor(s) 204, memory 208, and communication interfaces 212 a . . . n. These resources may enable functionality of the server 108 as will be described herein.

The processor(s) 204 can correspond to one or many computer processing devices. For instance, the processor(s) 204 may be provided as silicon, as a Field Programmable Gate Array (FPGA), an Application-Specific Integrated Circuit (ASIC), any other type of Integrated Circuit (IC) chip, a collection of IC chips, or the like. As a more specific example, the processor(s) 204 may be provided as a microcontroller, microprocessor, Central Processing Unit (CPU), or plurality of microprocessors that are configured to execute the instructions sets stored in memory 208. Upon executing the instruction sets stored in memory 208, the processor(s) 204 enable various centralized management functions over the tenant clusters.

The memory 208 may include any type of computer memory device or collection of computer memory devices. The memory 208 may include volatile and/or non-volatile memory devices. Non-limiting examples of memory 208 include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Electronically-Erasable Programmable ROM (EEPROM), Dynamic RAM (DRAM), etc. The memory 208 may be configured to store the instruction sets depicted in addition to temporarily storing data for the processor(s) 204 to execute various types of routines or functions.

The communication interfaces 212 a . . . n may provide the server 108 with the ability to send and receive communication packets (e.g., requests) or the like over the network 128. The communication interfaces 212 a . . . n may be provided as a network interface card (NIC), a network port, drivers for the same, and the like. Communications between the components of the server 108 and other devices connected to the network 128 may all flow through the communication interfaces 212 a . . . n. In some embodiments, the communication interfaces 212 a . . . n may be provided in a single physical component or set of components, but may correspond to different communication channels (e.g., software-defined channels, frequency-defined channels, amplitude-defined channels, etc.) that are used to send/receive different communications to the master API server 112 or API server 152.

The illustrative instruction sets that may be stored in memory 208 include, without limitation, in the infrastructure and application management (management plane) 124, the project controller 216, data protection/disaster recovery controller 220, domain/tenant cluster controller 224, policy controller 228, tenant controller 232, and application controller 236 and, in the stateful data and application services (data plane) 124, distributed storage controller 244, networker controller 248, data protection (DP)/disaster recovery (DR) 252, logical and physical drives 256, container integration 260, and scheduler 264. Functions of the application management server 108 enabled by these various instruction sets are described below. Although not depicted, the memory 208 may include instructions that enable the processor(s) 204 to store data into and retrieve data from data storage 110 and 112.

It should be appreciated that the instruction sets depicted in FIG. 2 may be combined (partially or completely) with other instruction sets or may be further separated into additional and different instruction sets, depending upon configuration preferences for the server 108. Said another way, the particular instruction sets depicted in FIG. 2 should not be construed as limiting embodiments described herein.

In some embodiments, the instructions for the project controller 216, when executed by processor(s), may enable the server 108 to control, on a project-by-project basis, the resource utilization based on project members and control things such as authorization of resources within a project or across other projects using a network access control list (ACL) policies. The project causes grouping of resources such as memory, CPU, storage and network and quota of these resources. The project members view or consume resources based on authorization policies. The projects could be on only one cluster or span across multiple or different clusters.

In some embodiments, instructions for the application mobility and disaster recovery controller 220 (at the management plane) and the data protection disaster recovery/DP 252 (at the data plane), when executed by processor(s), may enable the server 108 to implement containerized or VM-based application migration from one cluster to another cluster using migration agent controllers on individual clusters. The controller 252 could include a snapshot subcontroller to take periodic snapshots and a cluster health monitoring subcontroller running at the domain cluster to monitor health of containers.

In some embodiments, the instructions for the domain/tenant cluster controller 224, when executed by processor(s), may enable the server 108 to control provisioning of cloud-specific clusters and manage their native Kubernetes clusters. Other cluster operations that can be controlled include adopting an existing cluster, removing the cluster from the server 108, upgrading a cluster, creating the cluster, and destroying the cluster.

In some embodiments, instructions for the policy controller 228, when executed by the processor(s), may enable the server 108 to effect policy-based management, whose goal is to capture user intent via templates and enforce them declaratively for different applications, nodes, and clusters. An application may specify a policy for an application or snapshot volume storage. The policy controller can manage policy definitions and propagate them to individual clusters. The policy controller can interpret the policies and give the policy enforcement configuration to corresponding feature specific controllers. The policy controller could be run at the tenant cluster or at the master node based on functionality. To implement a snapshot policy, the policy manager can determine application volumes that need snapshots and make the snapshot controller configuration, which will be propagated to the cluster hosting volumes. The snapshot controller in cluster will continue taking periodic snapshots as per the configuration requirements. In the case of a cluster health monitoring controller, the policy controller can send the configuration to that health monitoring controller on the domain cluster itself.

Other examples of policy control include application policy management (e.g., containerized or VM-based application placement, failover, migration, and dynamic resource management), storage policy management (e.g., storage policy management controls the snapshot policy, backup policy, replication policy, encryption policy, etc. for an application), network policy management, security policies, performance policies, access control lists, and policy updates.

In some embodiments, instructions for the application controller 236, when executed by the processor(s), may enable the server 108 to deploy applications, effect application failover/fallback, application cloning, cluster cloning, and monitoring applications. In one implementation, the application controller enables users to launch their applications from the server 108 on individual clusters or a set of clusters using a Kubectl command.

In some embodiments, the instructions for the distributed storage controller 244 and scheduler 264, when executed by processor(s), may enable the server 108 to perform storage configuration, management and operations such as storage migration/replication/backup/snapshots, etc. By way of example, the distributed storage controller 244 and scheduler 264 can work with the policy controller 228 to manage and apply snapshot and application volume placement policies, such as limiting a number of mirrors for application and snapshot volumes (e.g., no mirror, two-way mirroring, as many mirrors as the volume-storing worker nodes, etc.), minimize storage space needed for a snapshot or application volume, maximize snapshot or application volume storage distribution across clusters and/or worker nodes, dedicate one or more clusters and/or worker nodes for all snapshot and/or application volumes, distribute snapshot and application volume storage substantially evenly across clusters and/or worker nodes, distribute snapshot and application volume storage unevenly across clusters and/or worker nodes giving more priority to clusters and/or worker nodes that have morespace available, select the cluster and/or worker node storage resources used for creation and lifecycle management of snapshot and application volumes so as to maintain SLAs for applications or micro-services which are using the parent volume(s), select cluster and/or worker node storage resources so as to achieve a certain RPO and/or RTO for recovery with the distributed storage controller automatically deciding on the placement and type of snapshot volume(s) to be created depending on the RPO/RTO to be achieved, and the like. These policies can be applied not only by the stateful data and application services or data plane layer 124 within a cluster but also by the infrastructure and application management or management plane across clusters.

In some embodiments, the instructions for the networker controller 248, when executed by processor(s), may enable the server 108 to enable multi-cluster or container networker (particularly at the data link and network layers) in which services or applications run mostly on one cluster and, for high availability reasons, use another cluster either on premises or on the public cloud. The service or application can migrate to other clusters upon user request or for other reasons. In most implementations, services run in one cluster at a time. The network controller 248 can also enable services to use different clusters simultaneously and enable communication across the clusters. The networker controller 248 can attach one or more interfaces (programmed to have a specific performance configuration) to a selected container while maintaining isolation between management and data networks. This can be done by each container having the ability to request one or more interfaces on specified data networks.

In some embodiments, the instructions for the logical drives 408 a-n when executed by processor(s), may enable the server 108 to provide a common API (via the Container Networker Interface) for connecting containers to an external network and expose (via the Container Storage Interface) arbitrary block and file storage systems to containerized or VM-based workloads. In some implementations, CSI can expose arbitrary block and file storage systems to containerized workloads on Container Orchestration Systems (COs), such as Kubernetes and AWS.

In some embodiments, the instructions for the container integration 260, when executed by processor(s), may enable the server 108 to provide (via OpenShift) a cloud-based container platform that is both containerization software and a platform-as-a-service (PaaS).

FIG. 3 illustrates the operations of the scheduler 264 and distributed storage controller 244 in more detail. The application server 108 is in communication, via network 128, with a plurality of worker nodes 300 a-n. While FIG. 3 depicts the master API server separate from the worker nodes, in some implementations the same node can act as both a master and worker node.

The database 112 is depicted as an “/etc distributed” or etcd key value store that stores physical data as key-value pairs in a persistent b+tree. Each revision of the etcd key value store's state typically contains only the delta from a previous revision for storage efficiency. A single revision may correspond to multiple keys in the tree. The key of a key-value pair is a 3-tuple (major, sub, type). The database 112, in this implementation, stores the entire state of a cluster: that is, it stores the cluster's configuration, specifications, and the statuses of the running workloads. In Kubernetes in particular, etcd's “watch” function monitor the data and reconfigures itself when changes occur.

The worker nodes 300 a-n can be part of a common cluster or different clusters 144, the same or different projects 140, and/or the same or different tenant clusters 132, depending on the implementation. The worker nodes 300 comprise the compute resources, drives on which volumes are created for applications, and network(s) that deploy, run, and manage containerized or VM-based applications. For example, a first worker node 300 a comprises an application 148 a, a node agent 304, and a database 308 containing storage resources. The node agent 304, or Kubelet in Kubernetes, runs on each worker node and ensures that all containers are running and healthy in a pod and makes any configuration changes on the worker nodes. The database 308 or other data storage resource corresponds to the pod associated with the worker node (e.g., the database 308 for first worker node 300 a is identified as “P0” for pod 0, the database 308 for the second worker node 300 b is identified as “P1” for pod 1, and the database 308 for the nth worker node 300 n is identified as “P2” for pod 2. Each database 308 in the first and second worker nodes 300 a and b is shown to include a volume associated with respective application 148 a and b. The volume in the nth worker node 300 n, depending on the implementation, could be associated with either of the applications 148 a orb. As will be appreciated, an application's volume can be divided among the storage resources of multiple worker nodes and is not limited to the storage resources of the worker node running the application.

The master API server 112, in response to user requests to instantiate an application or create an application or snapshot 312 volume, records the request in the etcd database 112, and, in response, the scheduler 264 determines on which database (s) 308 the volume should be created in accordance with placement polices specified by the policy controller 228. For example, the placement policy can select the worker node having the least amount of storage resources consumed at that point, that is required for optimal operation of the selected application 148, or that is selected by the user.

FIG. 4 depicts volume layout. A snapshot volume 400 is divided into subvolumes SV₁, SV₂, . . . SV_(n) 404 to match blocks reserved on the physical drive or the database 308, each of which is associated with a logical drive 256 a-n, which then associates the respective subvolume with a selected physical address on a database 308 for the corresponding block, where data in the corresponding subvolume is stored.

FIG. 5 further illustrates mapping of subvolumes for discrete snapshot volumes-1 and -2 and the current logical drive (LD) 408 onto physical drives in database 308 in a hyper-converged infrastructure. As will be appreciated, a logical drive or LD typically represents the smallest logical object in a volume layout. Each worker node has one or more drives for creating volumes for each application. The current LD represents the current state or image of the volume in use by the application while the snapshot volumes-1 and -2 represent prior states or images of the volume at specific points in time. Though any number of maps may be employed, an L0 and L1 map are typically used to map volumes onto the physical drives and enable an application to locate on the physical drive the particular block containing selected data. The map 500 for the snapshot volume-1 comprises a logical subvolume address 504 and, for each logical subvolume address 504 a corresponding block address 508 on a physical drive. Thus, the first logical subvolume address as shown by the arrow maps to a physical drive address “pblk-a”. The same relationship exists for the snapshot volume-2 map 512 and current LD map 516.

In some embodiments, the distributed storage controller 244 creates a storage space that optimizes storage for the snapshot volume to be created. If only a few gigabytes have been written in the volume, the distributed storage controller 244 allocates blocks for the volume that has been written for that point of time. When the distributed storage controller 244 creates a snapshot volume, it reserves some space for that snapshot volume on the selected worker node. At that point in time, the distributed storage controller 244 determines what is the usage of the volume on that particular worker node and how much space is required for the snapshot volume to be created. If there is already a prior snapshot volume stored on the selected worker node and the distributed storage controller 244 desires to create a second snapshot volume, the distributed storage controller 244 first determines what is the adequate storage space, or storage space required, creating the second snapshot volume given that there is already a prior snapshot volume stored on the worker node. the distributed storage controller 244 determines the differences, updates, or changes made to the second snapshot volume since the prior snapshot volume was created and allocates the storage space required only for storing the difference, updates, or changes on the selected worker node. Stated simply, the distributed storage controller 244 tries to create or reserve only that space needed for second snapshot volume, though the volume itself could be huge. In this manner, the distributed storage controller 244 can create multiple snapshot volumes on the selected worker node even for larger volumes.

By way of example, assume that the distributed storage controller 244 desires to create a 100 gigabyte volume to store data. The distributed storage controller 244 does not immediately assign blocks for all 100 gigabytes of data but determines how many blocks are required to store the data and locates the available blocks on the worker node drives. The distributed storage controller 244 then maps each block to the corresponding physical address on the drive. Each logical block has a corresponding physical block on the physical drive or storage media. The blocks do not need to be contiguous but can be located anywhere. The map enables the application to determine where each and every block is located so that it can collect and assemble the blocks as needed.

The operation of the distributed storage controller 244 is further illustrated by FIG. 5. FIG. 5 illustrates that each snapshot volume is space-optimized and shares unmodified blocks with, or is a linked clone of, the parent volume and, if applicable, a preceding snapshot volume. The selected first subvolume 504 as shown by arrow maps to a block at physical drive address “pblk-a” 508 at the time of the first snapshot. When the second snapshot is taken, the same first subvolume 504 has changed and now maps to a different block at a different physical drive address “pblk-x” 508. The current LD map 516 shows that the first subvolume 504 has not changed and still maps to the same block at physical drive address “pblk-x” 508. In contrast, an nth subvolume 504 at the time of the first and second snapshots maps to a common block at physical drive address “pblk-n” 508. However, the nth subvolume in the current LD map has changed and now maps to a new block at a different physical drive address “pblk-nn” 508. In other words, when only blocks that have changed since a preceding snapshot volume are changed in the later snapshot volume; the blocks that have not changed retain the same physical drive addresses in the earlier snapshot volume. In this way, the storage required for a later snapshot volume is reduced.

FIGS. 6A-6B, 7A-7C, and 8A-8C depict various user commands.

FIG. 6A depicts a volume create request 600 that creates a volume with three different mirrors on three different worker nodes in a selected cluster. The request includes the command to create 3 mirrored volumes (m3) and additionally fields for volume name, size, node identity, labels, phase, and status. A volume describe request (not shown) describes the three worker nodes on which the volume is created and contains fields including volume name, volume size, worker node name or identity, phase, attached to, device path, performance tier (e.g., best effort), and scheduled plexes/actual plexes, and, for each plex, the further fields of volume name and, for each volume name, worker node name, state, condition, out-of-syn-age, and resync-progress.

FIG. 6B depicts a snapshot create request 650 that creates a (typically read-only) LD snapshot volume of an application volume LDs. The snapshot create request 650 creates a snapshot volume for an application volume that already exists in the system. The request does not specify any placement policy or scheduler algorithms or mirror count, thereby permitting the snapshot volume to be created on any one of the available worker nodes where space exists. It will be created, by default, with a single mirror. The request makes the snapshot volume with the same layout as the parent application volume. Snapshot logical drives are created by copying L0 blocks from the parent volume logical drives, and the ownership of L1 blocks is transferred to the snapshot volume. The snapshot volume shares layout L1 and physical data blocks with the application volume and is therefore a linked clone of the application volume. The snapshot volume owns the L1 block with owner=1, the parent application volume shares the L1 block with owner=0 in its L0 block. Block ownership information is maintained at two levels, the first level being the L1 block level in the L0 block and the second level being the physical block level in the L1 block. The request 650 includes the command to create a snapshot of the selected “src vol1” and the following additional fields for snapshot volume name, snapshot volume size, node identity, labels, parent volume name, attached-to, device-path, phase, and age.

FIG. 8A depicts a snapshot describe request 800 for a specified snapshot volume. The request includes the following fields: snapshot volume name (e.g., “shap1”), size (21.51 GB), encryption (true or false), worker node name (“appserv19”), node selector (none selected), phase (available), status (available), age (27 seconds), scheduled plexes/actual plexes (1/1), and parent volume name (“vol1”). The request further includes the following field for related plexes: plex name (“snap1.p0”), node name (“appserv19”), and state (up).

As shown by FIG. 7C, the snapshot create request 780 can include an option prefix to create a snapshot volume using policy-based management. The request 780 requests creation of a snapshot volume of “source vol1” using a placement policy (identified as custom).

Policies can be created based on assigning weight to one or more scheduler algorithms in connection with snapshot or application volume mirroring (or synchronous replication). In creating a placement policy, the user can select one or more scheduler algorithms from a list of scheduler algorithms and, for each selected scheduler algorithm, assign a weight.

By way of illustration, FIG. 7A depicts a snapshot placement policy create request 700 that creates a selected snapshot placement policy, e.g., storage optimized. FIG. 7A also depicts a snapshot placement policy describe storage request 720 that provides detailed information about a requested snapshot placement policy (e.g., the storage optimized placement policy). As shown in FIG. 7A, the storage optimized placement policy assigns a weight of “1” to the BalanceStorageResourceUsage scheduler algorithm and “0” to the other scheduler algorithms. In another example shown in FIGS. 8B and 8C, the snapshot placement policy describe request 840 describes a default placement policy, which as shown by response 860 assigns a weight of “1” to the Least RecentlyUsed scheduler algorithm but “0” to each of the BalanceStorageResourceUsage, SkipInitiator, MostFrequentlyUsed, SelectorBased, and AllPlexes scheduler algorithms.

While the requests 700 and 840 depict placement policies using only a single scheduler algorithm for volume mirroring, FIG. 7B depicts a snapshot placement policy create request 740 to create a custom placement policy for mirroring with the user assigning selected weights to multiple scheduler algorithms. In the request 740, the user has assigned a weight of “1” to the BalanceStorageResourceUsage scheduler algorithm and a weight of “10” to the SkipInitiator scheduler algorithm. As shown by the response 760 to the request, the user has, by implication, assigned a weight of “0” to the remaining LeastRecentlyUsed, MostFrequentlyUsed, SelectorBased, and AllPlexes scheduler algorithms.

The BalanceStorageResourceUsage scheduler algorithm compares the memory storage in use for each worker node of a selected set of worker nodes and selects for snapshot or application volume storage the worker node that has the least amount of storage in use at that point in time. Stated differently, the algorithm compares the memory not in use for each worker node of a selected set of worker nodes and selects for snapshot or application volume storage the worker node having the greatest amount of memory not in use (or highest amount of free storage). Assume worker node 1 has one terabyte storage used while worker nodes 2 and 3 each have 100 gigs of storage in use. When the algorithm attempts to balance resource storage usage, the snapshot volume will not be created on worker node one because it has the maximum or highest storage in use when compared to the other worker nodes. The algorithm attempts to balance between worker node 2 and 3. In other words, a first snapshot volume will be stored on worker node 2, and a second, later snapshot volume will be stored on worker node 3. In this manner, the algorithm performs a round-robin between the worker nodes 2 and 3 with respect to later snapshot volume storage. Worker node is not scored high until the storage resource's usage becomes balanced across all the worker nodes in the selected cluster(s). The scoring will be based on the current storage resource consumption across the selected set of worker nodes.

The SkipInitiator scheduler algorithm skips the worker node(s) executing the application from consideration as a location for snapshot or application volume storage. The application associated with the application volume of which the snapshot is to be taken runs on a worker node known as the initiator node. The user can decide not to create the snapshot volume on the initiator node. For example, the user can try to create the snapshot volume for backup workflow, and the user does not want the application to be backed up on the worker node on which the application is running because a malfunction of the initiator node will impact both the application and the snapshot volume to be used in disaster recovery and restarting the application. This algorithm enables the user to skip the initiator node at all points in time to create the snapshot or application volume. A specified snapshot will thus always skip the initiator node when creating this snapshot volume. As will be appreciated, a snapshot volume is used for multiple purposes, including data protection, which is available locally on the node. Snapshot volumes can be also used to take a backup of an application's data from some cluster to store backups on the cloud. A backup application can be run to take locally stored snapshot volume data and copy it onto the cloud. When a user runs a backup application, this algorithm treats it as a high-priority application that should not be impacted by backup snapshot volume data being collocated on a common worker node with the backup application.

The LeastRecentlyUsed scheduler algorithm selects for the snapshot or application volume storage location the worker node of a set of worker nodes that is least recently used for snapshot or application volume storage (e.g., not used within a predetermined time period or last used when compared to other worker nodes in the cluster(s)), respectively. For instance if a first snapshot volume were created on worker node 1 in the set of worker nodes, a second snapshot volume will not be stored on worker node 1. When an application volume but no prior snapshot volume exists on worker nodes 2 and 3 in the set of worker nodes, the algorithm would select the least recently used worker node for creating the snapshot volume so that worker node 2 and 3 would have a higher score or likelihood of being selected for storing the second snapshot volume than worker node 1. Conversely, when both worker nodes 2 and 3 in the set of worker nodes have been used for snapshot volume storage at a time before worker node 1, the algorithm would select the least recently used worker node for creating the snapshot volume so that the least recently used of worker node 2 and 3 for snapshot volume storage would have a higher score or likelihood of being selected for storing the second snapshot volume than worker node 1.

The MostFrequentlyUsed scheduler algorithm selects, for snapshot or application volume storage, the worker node of a set of worker nodes that was used last (or most recently) for creating a snapshot or application volume, respectively. If worker node 1 was picked last for a snapshot volume placement, the algorithm tries to always pick worker node 1 for later snapshot volume placement, until worker node 1 runs out of memory space for creating snapshot volumes, in which case the algorithm moves to a next worker node in the set of worker nodes.

The AllPlexes scheduler algorithm could be characterized as “all mirrors”.as it creates a snapshot volume on all the worker nodes in the set of worker nodes where the selected application volume's mirrors exist. The user typically selects the scheduler algorithm when the user desires the snapshot volume to be highly available and be available on all worker nodes on the system.

A round robin scheduler algorithm (not shown) stores snapshot or application volume among the worker nodes in a set of worker nodes on a rotating basis in a specified order. The specified order can be, for instance, a circular order in which each worker node in the set of worker nodes receives one snapshot or application volume per cycle. The round robin algorithm can include round robin algorithm variants, such as weighted round robin (e.g., classical weighted round robin, interleaved weighted round robin, etc.) and deficit round robin or deficit weighted round robin.

The SelectorBased scheduler algorithm specifies that the user can have the option to specify on what worker nodes the user wants the snapshot or application volume created. For example, assume there are three nodes 1, 2, and 3 in the set of worker nodes. This algorithm enables the user to specify for volume one that he or she wants all snapshot volumes only on worker node 1 because worker node 1 has high availability. If the specifies a worker node, the highest policy placement weight to that node.

As will be appreciated, other scheduling algorithms may be used for snapshot or application volume mirroring depending on the application.

In formulating a custom placement policy within a selected cluster 144, the scheduler algorithms can be used alone or together, as determined by the user's assigned weights to the algorithms. A weight of “0” causes the associated scheduler algorithm not to be executed in selecting one or more worker nodes for snapshot or application volume placement. The relative weights of the weighted scheduler algorithms determine the relative priorities of the weighted scheduler algorithms in worker node selection. For example, assume a user, in defining a custom placement policy, selects two scheduler algorithms having different weights per policy, the scheduler first attempts to apply both scheduler algorithms consistently, and, when they produce conflicting or inconsistent placement selections, uses the weights as an arbiter. Each worker node is assigned a score based on the aggregate output of both scheduler algorithms. While any scoring algorithm can be employed, one approach uses each of the scheduler algorithms to assign a score to each worker node and the score is then multiplied by the corresponding algorithm's weight to produce a weight adjusted score. This is done for each scheduler algorithm and the weighted adjusted scores summed for each worker node. The worker node(s) having the highest total scores are selected for snapshot or application volume storage. If the user has a hard storage placement requirement, then the user can choose only one scheduler algorithm or, alternatively, assign the scheduler algorithm that preferentially selects that worker node the highest weight. This can provide the user with substantial flexibility and simplicity in formulating custom snapshot and application placement policies.

To illustrate how worker nodes are selected using multiple weighted scheduler algorithms, a first example will be discussed with reference to FIG. 7B. The user has assigned a first weight of “1” to the BalanceStorageResourceUsage scheduler algorithm and a second weight of “10” to the SkipInitiator scheduler algorithm. If BalanceStorageResourceUsage scheduler algorithm selects first and second worker nodes, each is assigned a score of “1”. If SkipInitiator scheduler algorithm selects the first worker node and a third worker node, each is assigned a score of “1”. The total weighted score for the first worker node is “11” (the sum of (10)×(1) and (1)×(1)), for the second worker node is “1”, and for the third worker node is “10”. The custom placement policy would thus cause the scheduler to select the first worker node as the placement location for a snapshot or application volume.

The policy can be further modified by a user rule that any worker node having a weighted score of a minimum amount is selected as the placement location for a snapshot or application volume. In the prior example, a user rule setting the minimum amount as “10” would cause the scheduler to select the first and third worker nodes as the placement location for the snapshot or application volume.

The placement policy can also be applied at the project 140 and/or cluster 144 level to select one or more clusters from among a plurality of clusters for snapshot or application volume asynchronous replication. When applied at the project 140 and/or cluster 144 level, the placement policy can be used to select, for a respective tenant and project, one or more clusters for snapshot or application volume placement by treating each cluster as a worker node in the description above.

As will be appreciated, there can be tiers or levels of placement policies. With reference to the prior paragraph, the scheduler algorithm could apply a first placement policy to select, for a given tenant and project, one or more clusters 144 from among plural clusters for snapshot or application volume asynchronous replication, and a second placement policy to select, within a cluster, one or more worker nodes from among plural worker nodes for the snapshot or application volume asynchronous replication. Apart from the object of the placement policies, the policies can be the same or different. For example, a BalanceStorageResourceUsage scheduler algorithm can be used at the tenant and project level to select one or more clusters and the SkipInitiator scheduler algorithm can be used at the cluster level to select one or more worker nodes within the one or more clusters selected by the BalanceStorageResourceUsage scheduler algorithm. Alternatively, for snapshot or application volume placement a first custom policy can assign a first set of weights to the BalanceStorageResourceUsage and SkipInitiator scheduler algorithms to select, at the tenant and project levels, one or more clusters from among plural clusters and a second custom placement policy assigning a different second set of weights to the BalanceStorageResourceUsage and SkipInitiator scheduler algorithms can be used at the cluster level to select one or more worker nodes from among plural worker nodes within the one or more clusters selected by the first custom placement policy.

Method of Operation of the Multi-Cloud Platform

With reference to FIGS. 3 and 9-10, an embodiment of a process to select a worker node within a cluster for snapshot volume placement will be discussed. While the process is discussed with reference to snapshot volume placement, it is to be understood that it could also be applied to selecting a cluster and/or worker node for application volume mirroring. While the process is discussed with reference to selecting a worker node within a cluster for snapshot volume placement, it is to be understood that it could also be applied to selecting a cluster from among multiple clusters for snapshot volume mirroring.

With reference to FIG. 9, the master API server 114 receives from a user (e.g., issued by the user through Command Line Interface (CLI)) a snapshot create request for a specified application's volume (step 904). An exemplary snapshot creation request is depicted in FIG. 6B.

In step 908, the master API server 114 records the snapshot creation request in the etcd database 112 and sets the snapshot creation request to the pending state.

In step 912, the scheduler 264, in response to the snapshot creation request in the pending state, identifies relevant snapshot placement policy(s) previously selected by the user. The scheduler 264 watches the etcd database 112 for any snapshot requests that are in the pending state.

In decision diamond 916, the scheduler 264 determines whether the relevant policy(s) stipulate more than one scheduler algorithm applies.

When the relevant policy(s) specify that more than one scheduler algorithm applies, the scheduler 264, in step 920, determines, for each selected or weighted scheduler algorithm, the worker nodes specified by the algorithm.

In step 924, the scheduler 264 assigns a weighted score or ranking to each worker node as noted above.

In step 928, the scheduler 264 selects the worker node(s) having the highest or (depending on the implementation) at least a minimum weighted score threshold as the worker nodes to be used for snapshot volume placement.

When the relevant policy(s) specify that only one scheduler algorithm applies, the scheduler 264, in step 928, selects the worker node(s) stipulated by the output of the scheduler algorithm.

In step 930, the scheduler 264 updates the etcd database 112 with the selected worker node(s) for the snapshot volume and sets the snapshot volume to the scheduled state and the snapshot create request to the completed state.

In response to the snapshot volume being in the scheduled state, the distributed storage controller 244, in step 1004, sends a quiesce request to the worker node(s) 300 where the application 148 corresponding to the snapshot volume is executing. The distributed storage controller 244 watches the etcd database 112 for any snapshot volumes that are in scheduled state.

In step 1008, the distributed storage controller 264 receives a response from the node agent 304 in the worker node(s) 300 after quiescing.

The distributed storage controller 264, in step 1012, sends a snapshot create request to the worker node(s) selected by the scheduler 264 based on the relevant placement policies.

In step 1016, the node agent 304 of each selected worker node 300 interacts with the quiesced worker node(s) and/or any worker node where the application volume for the application is stored (e.g., mirrored) to create a snapshot volume 312 and returns a response indicating that the snapshot volume 312 has been created.

In step 1020, the distributed storage controller 264 sends a resume request to the worker node(s) where the application is executing to resume execution.

In step 1024, the distributed storage controller 244 updates the database 112 to set the snapshot volume to the available state.

The exemplary systems and methods of this invention have been described in relation to cloud computing. However, to avoid unnecessarily obscuring the present invention, the preceding description omits a number of known structures and devices. This omission is not to be construed as a limitation of the scope of the claimed invention. Specific details are set forth to provide an understanding of the present invention. It should however be appreciated that the present invention may be practiced in a variety of ways beyond the specific detail set forth herein.

Furthermore, while the exemplary embodiments illustrated herein show the various components of the system collocated, certain components of the system can be located remotely, at distant portions of a distributed network, such as a LAN and/or the Internet, or within a dedicated system. Thus, it should be appreciated, that the components of the system can be combined in to one or more devices, such as a server, or collocated on a particular node of a distributed network, such as an analog and/or digital telecommunications network, a packet-switch network, or a circuit-switched network. It will be appreciated from the preceding description, and for reasons of computational efficiency, that the components of the system can be arranged at any location within a distributed network of components without affecting the operation of the system. For example, the various components can be located in a switch such as a PBX and media server, gateway, in one or more communications devices, at one or more users' premises, or some combination thereof. Similarly, one or more functional portions of the system could be distributed between a telecommunications device(s) and an associated computing device.

Furthermore, it should be appreciated that the various links connecting the elements can be wired or wireless links, or any combination thereof, or any other known or later developed element(s) that is capable of supplying and/or communicating data to and from the connected elements. These wired or wireless links can also be secure links and may be capable of communicating encrypted information. Transmission media used as links, for example, can be any suitable carrier for electrical signals, including coaxial cables, copper wire and fiber optics, and may take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.

Also, while the flowcharts have been discussed and illustrated in relation to a particular sequence of events, it should be appreciated that changes, additions, and omissions to this sequence can occur without materially affecting the operation of the invention.

A number of variations and modifications of the invention can be used. It would be possible to provide for some features of the invention without providing others.

In one embodiment, the systems and methods of this invention can be implemented in conjunction with a special purpose computer, a programmed microprocessor or microcontroller and peripheral integrated circuit element(s), an ASIC or other integrated circuit, a digital signal processor, a hard-wired electronic or logic circuit such as discrete element circuit, a programmable logic device or gate array such as PLD, PLA, FPGA, PAL, special purpose computer, any comparable means, or the like. In general, any device(s) or means capable of implementing the methodology illustrated herein can be used to implement the various aspects of this invention. Exemplary hardware that can be used for the present invention includes computers, handheld devices, telephones (e.g., cellular, Internet enabled, digital, analog, hybrids, and others), and other hardware known in the art. Some of these devices include processors (e.g., a single or multiple microprocessors), memory, nonvolatile storage, input devices, and output devices. Furthermore, alternative software implementations including, but not limited to, distributed processing or component/object distributed processing, parallel processing, or virtual machine processing can also be constructed to implement the methods described herein.

In yet another embodiment, the disclosed methods may be readily implemented in conjunction with software using object or object-oriented software development environments that provide portable source code that can be used on a variety of computer or workstation platforms. Alternatively, the disclosed system may be implemented partially or fully in hardware using standard logic circuits or VLSI design. Whether software or hardware is used to implement the systems in accordance with this invention is dependent on the speed and/or efficiency requirements of the system, the particular function, and the particular software or hardware systems or microprocessor or microcomputer systems being utilized.

In yet another embodiment, the disclosed methods may be partially implemented in software that can be stored on a storage medium, executed on programmed general-purpose computer with the cooperation of a controller and memory, a special purpose computer, a microprocessor, or the like. In these instances, the systems and methods of this invention can be implemented as program embedded on personal computer such as an applet, JAVA® or CGI script, as a resource residing on a server or computer workstation, as a routine embedded in a dedicated measurement system, system component, or the like. The system can also be implemented by physically incorporating the system and/or method into a software and/or hardware system.

Although the present invention describes components and functions implemented in the embodiments with reference to particular standards and protocols, the invention is not limited to such standards and protocols. Other similar standards and protocols not mentioned herein are in existence and are considered to be included in the present invention. Moreover, the standards and protocols mentioned herein and other similar standards and protocols not mentioned herein are periodically superseded by faster or more effective equivalents having essentially the same functions. Such replacement standards and protocols having the same functions are considered equivalents included in the present invention.

The present invention, in various embodiments, configurations, and aspects, includes components, methods, processes, systems and/or apparatus substantially as depicted and described herein, including various embodiments, subcombinations, and subsets thereof. Those of skill in the art will understand how to make and use the present invention after understanding the present disclosure. The present invention, in various embodiments, configurations, and aspects, includes providing devices and processes in the absence of items not depicted and/or described herein or in various embodiments, configurations, or aspects hereof, including in the absence of such items as may have been used in previous devices or processes, e.g., for improving performance, achieving ease and\or reducing cost of implementation.

The foregoing discussion of the invention has been presented for purposes of illustration and description. The foregoing is not intended to limit the invention to the form or forms disclosed herein. In the foregoing Detailed Description for example, various features of the invention are grouped together in one or more embodiments, configurations, or aspects for the purpose of streamlining the disclosure. The features of the embodiments, configurations, or aspects of the invention may be combined in alternate embodiments, configurations, or aspects other than those discussed above. This method of disclosure is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment, configuration, or aspect. Thus, the following claims are hereby incorporated into this Detailed Description, with each claim standing on its own as a separate preferred embodiment of the invention.

Moreover, though the description of the invention has included description of one or more embodiments, configurations, or aspects and certain variations and modifications, other variations, combinations, and modifications are within the scope of the invention, e.g., as may be within the skill and knowledge of those in the art, after understanding the present disclosure. It is intended to obtain rights which include alternative embodiments, configurations, or aspects to the extent permitted, including alternate, interchangeable and/or equivalent structures, functions, ranges or steps to those claimed, whether or not such alternate, interchangeable and/or equivalent structures, functions, ranges or steps are disclosed herein, and without intending to publicly dedicate any patentable subject matter. 

What is claimed is:
 1. A method, comprising: receiving, by a processor, a request to create a volume in accordance with a volume placement policy that comprises a plurality of scheduler algorithms, each scheduler algorithm of the plurality of scheduler algorithms to select one or more worker nodes from a plurality of worker nodes for volume storage; determining, by the processor based on output from the plurality of scheduler algorithms, the one or more worker nodes, wherein output of each of the plurality of scheduler algorithms is assigned a weight in determining the one or more worker nodes; and causing, by the processor, a node agent in each of the one or more worker nodes to create the volume.
 2. The method of claim 1, wherein the volume comprises a snapshot volume, wherein the snapshot volume comprises a replicated volume, wherein the plurality of worker nodes are associated with a plurality of clusters corresponding to a common tenant, wherein weights assigned to the plurality of scheduler algorithms are the same, wherein at least one scheduler algorithm other than the plurality of scheduler algorithms is assigned a weight less than the weights assigned to the plurality of scheduler algorithms, wherein the volume placement policy selects one or more clusters from among the plurality of clusters for volume creation, and wherein the plurality of scheduler algorithms comprise a plurality of: a least recently used scheduler algorithm, a balance storage resource scheduler algorithm, a skip initiator scheduler algorithm, a most frequently used scheduler algorithm, a selector based scheduler algorithm, an all plexes scheduler algorithm, and a round robin scheduler algorithm.
 3. The method of claim 1, wherein the volume comprises a mirrored application volume, wherein the plurality of worker nodes are associated with a common cluster corresponding to a common tenant, wherein weights assigned to the plurality of scheduler algorithms are different, and wherein the plurality of scheduler algorithms comprise a plurality of: a least recently used scheduler algorithm, a balance storage resource scheduler algorithm, a skip initiator scheduler algorithm, a most frequently used scheduler algorithm, a selector based scheduler algorithm, an all plexes scheduler algorithm, and a round robin scheduler algorithm.
 4. The method of claim 1, wherein the volume placement policy comprises first and second placement policies and wherein the determining comprises: determining, by the processor based on an output of the first placement policy, a target cluster from among a plurality of clusters for volume storage; and determining, by the processor based upon an output of the second placement policy, the one or more worker nodes within the target cluster for volume storage, the one or more worker nodes being part of the target cluster.
 5. The method of claim 4, wherein the first and second placement policies assign weights to different scheduler algorithms.
 6. The method of claim 4, wherein the first and second placement policies assign different weights to common scheduler algorithms.
 7. The method of claim 1, wherein the volume comprises a snapshot volume, wherein the plurality of scheduler algorithms in the volume placement policy perform one or more of: limits a number of mirrors for snapshots, minimizes storage space needed for a snapshot, maximizes snapshot storage distribution across the plurality of worker nodes, dedicates the one or more worker nodes for a plurality of snapshots, distributes snapshot volume storage substantially uniformly across the plurality of worker nodes, distributes snapshot volume storage giving more priority to worker nodes in the plurality of worker nodes that have more storage space available, selects the worker node used for creation and lifecycle management of snapshot volumes to maintain Service Level Agreements (SLAs) for an applications corresponding to the volume, selects worker node storage resources to achieve a predetermined Recovery Point Objective (RPO) and/or Recovery Time Objective (RTO) for application recovery with the processor automatically deciding on placement and type of snapshot volume(s) to be created depending on the RPO and/or RTO to be achieved, wherein a first worker node of the plurality of worker nodes comprises a first weighted score, a second worker node of the plurality of worker nodes comprises a different second weighted score, and wherein the one or more worker nodes comprises the first but not the second worker node.
 8. A server comprising: a communication interface to transmit and receive communications; a processor coupled with the communication interface; and a computer readable medium, coupled with and readable by the processor and storing therein a set of instructions that, when executed by the processor, causes the processor to: receive a request to create a volume, wherein a volume placement policy comprises a plurality of scheduler algorithms, each scheduler algorithm of the plurality of scheduler algorithms to select one or more worker nodes from among a plurality of worker nodes for volume storage; determine, based on an output from the plurality of scheduler algorithms, the one or more worker nodes, wherein an output of each of the plurality of scheduler algorithms is assigned a weight in determining the one or more worker nodes; and instruct a node agent in each of the one or more worker nodes to create the volume.
 9. The server of claim 8, wherein the volume comprises a snapshot volume, wherein the snapshot volume comprises a replicated volume, wherein the plurality of worker nodes are associated with a plurality of clusters corresponding to a common tenant, wherein weights assigned to the plurality of scheduler algorithms are the same, wherein at least one scheduler algorithm other than the plurality of scheduler algorithms is assigned a weight different than the weights assigned to the plurality of scheduler algorithms, wherein the volume placement policy selects one or more clusters from among the plurality of clusters for volume creation, and wherein the plurality of scheduler algorithms comprise a plurality of: a least recently used scheduler algorithm, a balance storage resource scheduler algorithm, a skip initiator scheduler algorithm, a most frequently used scheduler algorithm, a selector based scheduler algorithm, an all plexes scheduler algorithm, and a round robin scheduler algorithm.
 10. The server of claim 8, wherein the volume comprises a mirrored application volume, wherein the plurality of worker nodes are associated with a common cluster corresponding to a common tenant, wherein weights assigned to the plurality of scheduler algorithms are different, and wherein the plurality of scheduler algorithms comprise a plurality of: a least recently used scheduler algorithm, a balance storage resource scheduler algorithm, a skip initiator scheduler algorithm, a most frequently used scheduler algorithm, a selector based scheduler algorithm, an all plexes scheduler algorithm, and a round robin scheduler algorithm.
 11. The server of claim 8, wherein the volume placement policy comprises first and second placement policies and wherein determining the one or more worker nodes comprises: determining, based on an output of the first placement policy, a target cluster from among a plurality of clusters for volume storage; and determining, based upon an output of the second placement policy, the one or more worker nodes within the target cluster for volume storage, the one or more worker nodes being part of the target cluster.
 12. The server of claim 11, wherein the first and second placement policies assign weights to different scheduler algorithms.
 13. The server of claim 11, wherein the first and second placement policies assign different weights to common scheduler algorithms.
 14. The server of claim 8, wherein the volume comprises a snapshot volume, wherein the plurality of scheduler algorithms in the volume placement policy perform one or more of: limits a number of mirrors for snapshots, minimizes storage space needed for a snapshot, maximizes snapshot storage distribution across the plurality of worker nodes, dedicates the one or more worker nodes for a plurality of snapshots, distributes snapshot volume storage substantially uniformly across the plurality of worker nodes, distributes snapshot volume storage giving more priority to worker nodes in the plurality of worker nodes that have more storage space available, selects the worker node used for creation and lifecycle management of snapshot volumes to maintain Service Level Agreements (SLAs) for an applications corresponding to the volume, selects worker node storage resources to achieve a predetermined Recovery Point Objective (RPO) and/or Recovery Time Objective (RTO) for application recovery with the processor automatically deciding on placement and type of snapshot volume(s) to be created depending on the RPO and/or RTO to be achieved, wherein a first worker node of the plurality of worker nodes comprises a first weighted score, a second worker node of the plurality of worker nodes comprises a different second weighted score, and wherein the one or more worker nodes comprises the first but not the second worker node.
 15. A method, comprising: receiving, by a processor, a request to create a volume in accordance with a volume placement policy that comprises a plurality of scheduler algorithms, each scheduler algorithm of the plurality of scheduler algorithms to select one or more clusters of worker nodes from among a plurality of worker node clusters corresponding to a common tenant for volume storage; determining, by a processor based on an output from the plurality of scheduler algorithms, the one or more worker node clusters, wherein an output of each of the plurality of scheduler algorithms is assigned a weight in determining the one or more worker node clusters; and causing, by the processor, a node agent in each of the one or more worker node clusters to create the volume on a worker node in each of the one or more worker node clusters.
 16. The method of claim 15, wherein the volume comprises a snapshot volume, wherein the snapshot volume comprises a replicated volume, wherein a plurality of worker nodes are associated with the plurality of worker node clusters, wherein weights assigned to the plurality of scheduler algorithms are the same, wherein at least one scheduler algorithm other than the plurality of scheduler algorithms is assigned a weight less than the weights assigned to the plurality of scheduler algorithms, wherein the volume placement policy selects one or more worker nodes within a common cluster from among the plurality of worker nodes for volume creation, and wherein the plurality of scheduler algorithms comprise a plurality of: a least recently used scheduler algorithm, a balance storage resource scheduler algorithm, a skip initiator scheduler algorithm, a most frequently used scheduler algorithm, a selector based scheduler algorithm, an all plexes scheduler algorithm, and a round robin scheduler algorithm.
 17. The method of claim 15, wherein the volume comprises a mirrored application volume, wherein the plurality of worker node clusters are associated with a plurality of worker nodes, wherein weights assigned to the plurality of scheduler algorithms are different, and wherein the plurality of scheduler algorithms comprise a plurality of: a least recently used scheduler algorithm, a balance storage resource a scheduler algorithm, a skip initiator scheduler algorithm, a most frequently used scheduler algorithm, a selector based scheduler algorithm, an all plexes scheduler algorithm, and a round robin scheduler algorithm.
 18. The method of claim 15, wherein the volume placement policy comprises first and second placement policies and wherein the determining of the one or more worker node clusters comprises: the processor determining, based on an output of the first placement policy, a target worker node cluster from among a plurality of worker node clusters for volume storage; and the processor determining, based upon an output of the second placement policy, one or more worker nodes within the target cluster for volume storage, the one or more worker nodes being part of the target cluster.
 19. The method of claim 18, wherein the first and second placement policies assign weights to different scheduler algorithms.
 20. The method of claim 18, wherein the first and second placement policies assign different weights to common scheduler algorithms.
 21. The method of claim 15, wherein the volume comprises a snapshot volume, wherein the plurality of scheduler algorithms in the volume placement policy perform one or more of: limits a number of mirrors for snapshots, minimizes storage space needed for a snapshot, maximizes snapshot storage distribution across a plurality of worker nodes, dedicates one or more worker nodes for a plurality of snapshots, distributes snapshot volume storage substantially uniformly across the plurality of worker nodes, distributes snapshot volume storage giving more priority to worker nodes in the plurality of worker nodes that have more storage space available, selects the worker node used for creation and lifecycle management of snapshot volumes to maintain Service Level Agreements (SLAs) for an applications corresponding to the volume, selects worker node storage resources to achieve a predetermined Recovery Point Objective (RPO) and/or Recovery Time Objective (RTO) for application recovery with the processor automatically deciding on placement and type of snapshot volume(s) to be created depending on the RPO and/or RTO to be achieved, wherein a first worker node cluster of the plurality of worker node clusters comprises a first weighted score, a second worker node cluster of the plurality of worker node clusters comprises a different second weighted score, and wherein the one or more worker node clusters comprises the first but not the second worker node cluster. 